STAT 331 Final Project

Author

Gabe Riedel, Lena Kimura, Diego Jara, Drew Cloughley

Data Set-Up and Cleaning

Code
set.seed(1245124)
library(tidyverse)
library(knitr)
library(gganimate)
library(gifski)
library(kableExtra)
library(glue)
library(broom)
library(bibtex)
library(patchwork)
Code
income <- read_csv("data/avg_daily_income.csv")
women <- read_csv("data/mean_years_in_school_women_25_34_years.csv")

income_longer <- income |> 
  pivot_longer(cols = `1800`:`2100`, 
               names_to = "Year", 
               values_to = "Income")

women_longer <- women |> 
  pivot_longer(cols = `1970`:`2015`, 
               names_to = "Year", 
               values_to = "years_in_school")
Code
full_data <- income_longer |> 
  inner_join(women_longer, by = join_by(Year, country))

full_data <- full_data |> 
  mutate(Region = fct_collapse(factor(country),
    "North America" = c("Canada", "USA", "Mexico"),
    "South America" = c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", 
                        "Guyana", "Paraguay", "Peru", "Suriname", "Uruguay", "Venezuela"),
    "Central America" = c("Antigua and Barbuda", "Bahamas", "Barbados", "Cuba", "Dominica", 
                          "Dominican Republic", "Grenada", "Haiti", "Jamaica", "St. Lucia", 
                          "St. Vincent and the Grenadines", "Trinidad and Tobago", "Belize", 
                          "Costa Rica", "El Salvador", "Guatemala", "Honduras","Nicaragua", 
                          "Panama"),
    "Europe" = c("Andorra","Austria", "Belgium", "France", "Germany", "Luxembourg",
                 "Netherlands", "Switzerland", "UK", "Belarus", "Bulgaria", "Croatia", 
                 "Czech Republic", "Estonia", "Hungary", "Latvia", "Lithuania", "Moldova",
                 "Poland", "Romania", "Russia", "Slovak Republic", "Ukraine", "Albania", 
                 "Bosnia and Herzegovina", "Cyprus", "Greece", "Italy", "Malta", "Montenegro",
                 "Portugal", "Serbia", "Slovenia", "Spain", "North Macedonia", "Denmark",
                 "Finland", "Iceland", "Ireland", "Norway", "Sweden"),
    "Middle East and North Africa" = c("Bahrain", "Egypt", "Iran", "Iraq", "Israel", 
                                       "Palestine", "Jordan", "Kuwait", "Lebanon", "Oman",
                                       "Qatar", "Saudi Arabia", "UAE", "Yemen", "Turkey",
                                       "Syria", "Armenia", "Azerbaijan", "Georgia", 
                                       "Afghanistan",  "Libya",  "Morocco",  "Pakistan",  
                                       "Somalia",  "Tunisia"),
    "Asia" = c("Bangladesh", "Bhutan", "India", "Maldives", "Nepal",
               "Sri Lanka", "Uzbekistan", "Turkmenistan", "Tajikistan", "Kyrgyz Republic",
               "Kazakhstan", "Brunei", "Cambodia", "Indonesia", "Lao", "Malaysia",
               "Myanmar", "Philippines", "Singapore", "Thailand", "Timor-Leste", "Vietnam",
               "China", "Japan", "North Korea", "South Korea", "Mongolia", "Taiwan"),
    "Oceania" = c("Australia", "Fiji", "Kiribati", "Marshall Islands", "New Zealand", 
                  "Papua New Guinea", "Solomon Islands", "Tonga", "Vanuatu", 
                  "Micronesia, Fed. Sts.", "Samoa"),
    "Sub-Saharan Africa" = c("Algeria", "Benin", "Burkina Faso", "Cape Verde", "Chad", 
                             "Comoros", "Cote d'Ivoire", "Djibouti", "Eritrea", "Gambia", 
                             "Ghana", "Liberia", "Mali", "Mauritania", "Mauritius","Niger", 
                             "Nigeria","Burundi", "Cameroon", "Central African Republic", 
                             "Congo, Dem. Rep.", "Congo, Rep.", "Ethiopia", "Equatorial Guinea",
                             "Gabon", "Guinea", "Guinea-Bissau", "Kenya", "Rwanda",
                             "Seychelles","South Sudan", "Namibia", "Uganda", 
                             "Sao Tome and Principe", "Senegal", "Sierra Leone", "Sudan", 
                             "Togo","Angola", "Botswana", "Eswatini", "Lesotho", "Madagascar", 
                             "Malawi", "Mozambique", "South Africa", "Tanzania", "Zambia", 
                             "Zimbabwe")
  )) 

Introduction

Data/Variable Description

In this analysis, we will use two variables from Gapminder: (Rosling 2017) average daily income, and average years in school of women ages 25-34. Our first dataset includes information on the average daily income in 2017 dollars from the year 1800 to 2100 (projected). Our second dataset includes the mean number of years enrolled in school for women between the ages 25-34 from 1970 to 2015. Each observation is uniquely identified by a different country. Our analysis will include the years shared by both datasets (1970-2015) and the shared countries (188 in common). Our datasets did not have any missing values or inconsistent data types. In addition, we decided to divide all of the countries in the dataset into 8 distinct regions in alignment with the “Official Listing of Countries by World Region: The Eight Groupings of the World by Location and Culture.” (Rosenberg August 28, 2024)

Hypothesized Variable Relationship

We hypothesize that nations with women between ages 25 and 34 that have spent more years in school will have a higher average daily income because women in this age range may have also completed graduate-level education.

Data Visualization

Code
full_data |> 
  group_by(country) |> 
  mutate(avg_income = mean(Income), 
         avg_school = mean(years_in_school)) |>
  ggplot(aes(x = avg_school, y = avg_income, color = Region)) +
  geom_point() +
  labs(x = "Number of Years in School", 
       y = "Income", 
       title = "Income and Number of Years in School by Women across the World",
       subtitle = "Income and School Years were Averaged across Country") +
  theme_bw()

In the scatterplot above, every dot on the graph represents a different country where they are colored by their corresponding region. The x-axis represents the average number of years that women are in school, where the y-axis displays the average daily income for a household. To properly display all the data from the years 1970 to 2015, the average of both variables was taken respective to the country and placed on the scatterplot above.

Code
# animation functions found in R graph gallery page
# https://r-graph-gallery.com/package/gganimate.html

animation <- full_data |> 
  mutate(Year = as.numeric(Year)) |> 
  ggplot(aes(x = years_in_school, y = Income, color = Region)) +
  geom_point() +
  labs(x = "Number of Years in School", 
       y = "Income", 
       title = "Income and Number of Years in School by Women across the World 1970-2015",
       subtitle = "Income and School Years across Country") +
  theme_bw()

anim <- animation + transition_time(Year) +
  ease_aes("linear")

anim

Linear Regression

\[Log_{10}(MeanIncome) = 0.3088 + 0.0959*MeanEducation \]

Because we took the base-10 logarithm of the mean income by country, a slope of 0.0959 indicates that income increases by estimate of 24.71% for every additional year of education for women ages 25-54. At 0 years of education, average income is approximately $2.04.

Code
regression_model_data <- 
  full_data |> 
  group_by(country) |> 
  summarize(mean_inc = mean(Income),
            mean_educ = mean(years_in_school))

lm_income_educ <- lm(log10(mean_inc) ~ mean_educ, data = regression_model_data)
kable(tidy(lm_income_educ))
term estimate std.error statistic p.value
(Intercept) 0.3088025 0.0454792 6.789973 0
mean_educ 0.0959463 0.0058351 16.442946 0

The \(Log_{10}\) transformation of average daily income helped to linearize our model shown in the scatterplot below. Like the figure in 2.1, each dot on the graph represents a different country. The x-axis represents the average number of years that women are in school, where the y-axis displays the \(Log_{10}\) of the average daily income for a household. To properly display all the data from the years 1970 to 2015, the average of both variables was taken respective to the county and placed on the scatterplot above.

Code
regression_model_data |> 
  ggplot(aes(x = mean_educ, 
             y = log10(mean_inc))) +
  geom_point() + 
  geom_smooth(method = lm) +
  labs(x = "Number of Years in School", 
       y = "Log10(Income)",
       title = "Log Income and Number of Years in School for Women across the World",
       subtitle = "Income and number of years in school were averaged across country") +
  theme_bw()

Model Fit

Code
var_response <- var(log10(regression_model_data$mean_inc))
var_fitted <- var(lm_income_educ$fitted)
var_residuals <- var(lm_income_educ$residuals)

var_table <- data.frame(c("Variance in Response", "Variance in Fitted Values", "Variance in Residuals"),
                        c(var_response, var_fitted, var_residuals))

var_table |>
  kable(col.names = c("Metric", "Variance")) |>
  kable_styling(full_width = FALSE, 
                bootstrap_options = c("striped", "hover")) |>
  add_header_above(c("Variability" = 2)) |>
  row_spec(0, bold = TRUE, color = "black")
Variability
Metric Variance
Variance in Response 0.1810507
Variance in Fitted Values 0.1072610
Variance in Residuals 0.0737897
Code
var_prop_explained <- round(var_fitted/var_response,4) * 100
var_prop_unexplained <- round(var_residuals/var_response,4) * 100
var_percent_explained <- glue("{var_prop_explained}%")
var_percent_unexplained <- glue("{var_prop_unexplained}%")

Our ‘Variability’ table shows the variance in our response, fitted, and residual values respectively. In order to find how much of the variability in our reponses is explained by the model, we divide the variance of the fitted values by the variance of the response values.

Doing this math, we find the percent of variability explained by the model is 59.24%. This means our model is accounting for a moderate amount of variability. There is still 40.76% unexplained by the model, so other factors outside of years in school are influencing this response. Overall, for simple linear regression, our model is of moderate to high quality.

Simulation

Visualizing Simulations from the Model

Code
model_predict <- predict(lm_income_educ)
est_sigma <- sigma(lm_income_educ)

rand_error <- function(x, mean = 0, sd){
  return(x + rnorm(length(x), mean, sd))
}

sim_response <- tibble(sim_income_pred = rand_error(x=model_predict, 
                                                    sd=est_sigma))
combined_data <- regression_model_data |> 
  select(mean_educ, mean_inc, country) |> 
  bind_cols(sim_response)
Code
sim_reg_p <- combined_data |>
  ggplot(aes(x = mean_educ, 
             y = sim_income_pred)) + 
  geom_point() + 
  geom_smooth(method=lm)+
  labs(x = "Number of Years in School", 
       y = "Log10(Income)",
       subtitle = "Income and years in school averaged \nacross country",
       title = "Simulated Log Income based \non Regression Model" ) + 
  theme_bw()

obs_reg_p <- combined_data |> 
  ggplot(aes(x = mean_educ, 
             y = log10(mean_inc))) +
           geom_point() + 
  geom_smooth(method = lm) +
  labs(x = "Number of Years in School", 
       y = "Log10(Income)",
       title = "Observed Log Income",
       subtitle = "Income and years in school averaged \nacross country") +
  theme_bw()

obs_reg_p + sim_reg_p

Overall, the two plots are largely similar. Both the observed and simulated data follow positive, linear slopes with log-adjusted income values from about 0 to 2. The observed data plot appears to have slightly more variablity in log-adjusted income values, and its regression line seems to be steeper than that of the simulated data regression line.

Generating Multiple Predicative Checks

Code
# functions for this section were obtained from the textbook
# https://manncz.github.io/stat331-calpoly-text/10-predictive-checks.html#ch10-checkins 

sims <- map_dfc(.x = 1:1000,
                .f = ~ tibble(sim = rand_error(model_predict, 
                                          sd = est_sigma)
                              )
                )
colnames(sims) <- colnames(sims) |> 
  str_replace(pattern = "\\.\\.\\.",
                  replace = "_")

sims2 <- regression_model_data |> 
  filter(!is.na(mean_inc), 
         !is.na(mean_educ)) |> 
  mutate(log_mean_inc = log10(mean_inc)) |>
  select(log_mean_inc) |> 
  bind_cols(sims)

sim_r_sq <- sims2 |> 
  map(~ lm(log_mean_inc ~ .x, data = sims2)) |> 
  map(glance) |> 
  map_dbl(~ .x$r.squared)
sim_r_sq <- sim_r_sq[names(sim_r_sq) != "log_mean_inc"]
Code
tibble(sims = sim_r_sq) |> 
  ggplot(aes(x = sims)) + 
  geom_histogram(binwidth = 0.025) +
  labs(x = expression("Simulated"~ R^2),
       y = "",
       subtitle = "Number of Simulated Models") +
  theme_bw()

Code
avg_var <- round(mean(sim_r_sq),2)*100

Our distribution of \(R^2\) values of simulated datasets is approximately normal, has values between 0.2 and 0.5, and is centered around 0.35. This suggests the data we simulated under this model have low to moderate similarity with our observed data. On average, our simulated data account for 35% of the variability in the observed log-adjusted income.

Conclusion

We explored the relationship between average daily income (in 2017 dollars) and average years in school for women ages 25-34 across 8 geographic regions, and averaged over 45 years. We performed a linear regression across all countries and log-transformed average income to linearize our dataset. Our model estimated a 24.71% increase in mean income for every additional year of education for women ages 25-54. Additionally, our linear model was able to explain 59.24% of the variance in our dataset, suggesting further predictors may better explain average income. When we generated a predictive model simulating our observed linear regression, we found it had a similar linear relationship and variability as our observed model. Our predictive checks ascribed a low to moderate relationship between our observed and predicted models. In conclusion, we found a moderate, positive correlation between number of years in school for women ages 25-34 and average daily income, which may benefit from exploring additional predictors of daily income.

References

Rosenberg, Matt. August 28, 2024. “Official Listing of Countries by World Region: The Eight Groupings of the World by Location and Culture,” August 28, 2024. https://www.thoughtco.com/official-listing-of-countries-world-region-1435153.
Rosling, Ola. 2017. “Gapminder.” https://www.gapminder.org/.